12 research outputs found
Confidence-based Ensembles of End-to-End Speech Recognition Models
The number of end-to-end speech recognition models grows every year. These
models are often adapted to new domains or languages resulting in a
proliferation of expert systems that achieve great results on target data,
while generally showing inferior performance outside of their domain of
expertise. We explore combination of such experts via confidence-based
ensembles: ensembles of models where only the output of the most-confident
model is used. We assume that models' target data is not available except for a
small validation set. We demonstrate effectiveness of our approach with two
applications. First, we show that a confidence-based ensemble of 5 monolingual
models outperforms a system where model selection is performed via a dedicated
language identification block. Second, we demonstrate that it is possible to
combine base and adapted models to achieve strong results on both original and
target data. We validate all our results on multiple datasets and model
architectures.Comment: To appear in Proc. INTERSPEECH 2023, August 20-24, 2023, Dublin,
Irelan
Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition
Automatic speech recognition models are often adapted to improve their
accuracy in a new domain. A potential drawback of model adaptation to new
domains is catastrophic forgetting, where the Word Error Rate on the original
domain is significantly degraded. This paper addresses the situation when we
want to simultaneously adapt automatic speech recognition models to a new
domain and limit the degradation of accuracy on the original domain without
access to the original training dataset. We propose several techniques such as
a limited training strategy and regularized adapter modules for the Transducer
encoder, prediction, and joiner network. We apply these methods to the Google
Speech Commands and to the UK and Ireland English Dialect speech data set and
obtain strong results on the new target domain while limiting the degradation
on the original domain.Comment: To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qata
Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator
We propose an end-to-end Automatic Speech Recognition (ASR) system that can
be trained on transcribed speech data, text-only data, or a mixture of both.
The proposed model uses an integrated auxiliary block for text-based training.
This block combines a non-autoregressive multi-speaker text-to-mel-spectrogram
generator with a GAN-based enhancer to improve the spectrogram quality. The
proposed system can generate a mel-spectrogram dynamically during training. It
can be used to adapt the ASR model to a new domain by using text-only data from
this domain. We demonstrate that the proposed training method significantly
improves ASR accuracy compared to the system trained on transcribed speech
only. It also surpasses cascade TTS systems with the vocoder in the adaptation
quality and training speed.Comment: Accepted to INTERSPEECH 202
A Chat About Boring Problems: Studying GPT-based text normalization
Text normalization - the conversion of text from written to spoken form - is
traditionally assumed to be an ill-formed task for language models. In this
work, we argue otherwise. We empirically show the capacity of Large-Language
Models (LLM) for text normalization in few-shot scenarios. Combining
self-consistency reasoning with linguistic-informed prompt engineering, we find
LLM based text normalization to achieve error rates around 40\% lower than top
normalization systems. Further, upon error analysis, we note key limitations in
the conventional design of text normalization tasks. We create a new taxonomy
of text normalization errors and apply it to results from GPT-3.5-Turbo and
GPT-4.0. Through this new framework, we can identify strengths and weaknesses
of GPT-based TN, opening opportunities for future work